Morpheme Based Factored Language Models for German LVCSR
نویسندگان
چکیده
German is a highly inflectional language, where a large number of words can be generated from the same root. It makes a liberal use of compounding leading to high Out-of-vocabulary (OOV) rates, and poor Language Model (LM) probability estimates. Therefore, the use of morphemes for language modeling is considered a better choice for Large Vocabulary Continuous Speech Recognition (LVCSR) than the full-words. Thereby, better lexical coverage and less LM perplexities are achieved. On the other side, the use of Factored Language Models (FLMs) is considered a successful approach that allows the integration of many information sources to get better LM probability estimates. In this paper, we try a combined methodology for language modeling where both morphological decomposition and factored language modeling are used in one model called morpheme based FLM. Finally, we obtain around 2.5% relative reduction in Word Error Rate (WER) with respect to a traditional full-words system.
منابع مشابه
Morpheme Level Feature-based Language Models for German LVCSR
One of the challenges for Large Vocabulary Continuous Speech Recognition (LVCSR) of German is its complex morphology and high level of compounding. It leads to high Out-of-vocabulary (OOV) rates, and poor Language Model (LM) probabilities. In such cases, building LMs on morpheme level can be considered a better choice. Thereby, higher lexical coverage and lower LM perplexities are achieved. On ...
متن کاملFeature-rich sub-lexical language models using a maximum entropy approach for German LVCSR
German is a morphologically rich language having a high degree of word inflections, derivations and compounding. This leads to high out-of-vocabulary (OOV) rates and poor language model (LM) probabilities in the large vocabulary continuous speech recognition (LVCSR) systems. One of the main challenges in the German LVCSR is the recognition of the OOV words. For this purpose, data-driven morphem...
متن کاملMorpheme level hierarchical pitman-yor class-based language models for LVCSR of morphologically rich languages
Performing large vocabulary continuous speech recognition (LVCSR) for morphologically rich languages is considered a challenging task. The morphological richness of such languages leads to high out-of-vocabulary (OOV) rates and poor language model (LM) probabilities. In this case, the use of morphemes has been shown to increase the lexical coverage and lower the LM perplexity. Another approach ...
متن کاملMorpheme-Based Language Modeling for Amharic Speech Recognition
This paper presents the application of morpheme-based and factored language models in an Amharic speech recognition task. Since using morphemes in both acoustic and language models results, mostly, in performance degradation due to acoustic confusability and since it is problematic to use factored language models in standard word decoders, we applied the models in a lattice rescoring framework....
متن کاملTowards better language modeling for Thai LVCSR
One of the difficulties of Thai language modeling is the process of text corpus preparation. Because there is no explicit word boundary marker in written Thai text, word segmentation must be performed prior to training a language model. This paper presents two approaches to language model construction for Thai LVCSR based on pseudo-morpheme merging. The first approach merges pseudo-morphemes us...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011